Automatically Learned vs. Hand-crafted Text Analysis Rules 1 Domain-speciic Text Analysis Figure 1: Output from a \management Succession" Text
نویسندگان
چکیده
As vast quantities of on-line text become available, there is an increasing need for systems that automatically analyze the conceptual content of natural language text. Systems that operate on narrowly de ned domains show promise, but require a di erent set of domainspeci c rules for each application. This paper describes CRYSTAL, a system that learns text analysis rules automatically from examples. Rules induced by CRYSTAL achieve performance approaching that of hand-crafted rules. CRYSTAL has a particularly e cient learning algorithm that is not improved by more extensive search. This o ers a practical alternative to time-consuming manual knowledge engineering for each new domain. 1 Domain-speci c Text Analysis With the increasing amounts of on-line text available, the need is growing for automated text analysis systems that go beyond keywords to extract the conceptual content of the text. This requires a system that can reliably extract both explicitly stated information and information that can be reasonably inferred. General purpose text understanding is still beyond the reach of current technology, but considerable progress has been made by restricting the problem to a prede ned set of concepts in a narrowly de ned domain. A text analysis system with the appropriate domainspeci c knowledge sources can identify references to information that is of interest to a particular domain, which consists of a corpus of texts together with a set of concepts to be identi ed in those texts. The target concepts in a medical domainmight be references to symptoms and diagnoses in patient records. In a collection of Wall Street Journal articles, the target This material is based on work supported in part by the National Science Foundation, Library of Congress and Department of Commerce under cooperative agreement number EEC-9209623 and in part by NRaD Contract Number N66001-94-D-6054. concept might be management succession events: persons moving into top management positions in corporations and persons moving out of those positions. The ARPA-sponsored Sixth Message Understanding Conference [MUC-6 1995] used such a \Management Succession" domain. This domain is illustrated by Figure 1.
منابع مشابه
Crystal: Learning Domain-speciic Text Analysis Rules
An enormous amount of knowledge is needed to infer the meaning of unrestricted natural language. The problem can be reduced to a manageable size by restricting attention to a prede ned set of concepts in a speci c domain. Two widely di erent domains are used to illustrate this domain-speci c approach. One domain is a collection of Wall Street Journal articles in which the target concept is mana...
متن کاملThe CRYSTAL algorithm : Initialize Dictionary and Training Instances
Domain-speciic text analysis requires a dictionary of linguistic patterns that identify references to relevant information in a text. This paper describes CRYSTAL, a fully automated tool that induces such a dictionary of text extraction rules. We discuss some key issues in developing an automatic dictionary induction system, using CRYSTAL as a concrete example. CRYSTAL derives text extraction r...
متن کاملارائه مدلی برای استخراج اطلاعات از مستندات متنی، مبتنی بر متنکاوی در حوزه یادگیری الکترونیکی
As computer networks become the backbones of science and economy, enormous quantities documents become available. So, for extracting useful information from textual data, text mining techniques have been used. Text Mining has become an important research area that discoveries unknown information, facts or new hypotheses by automatically extracting information from different written documents. T...
متن کاملEfficient algorithm for Context Sensitive Aggregation in Natural Language generation
Aggregation is a sub-task of Natural Language Generation (NLG) that improves the conciseness and readability of the text outputted by NLG systems. Till date, approaches towards the aggregation task have been predominantly manual (manual analysis of domain specific corpus and development of rules). In this paper, a new algorithm for aggregation in NLG is proposed, that learns context sensitive a...
متن کاملBuilding a Machine Learning Based Text Understanding System
Text understanding systems are approaching the point of being a practical technology as long as the system is trained for a narrowly defined domain. Machine learning and statistical approaches can minimize the effort involved in adapting a text understanding system to a new domain. This paper presents a system whose goal is deep understanding, limited only by the necessity of designing a formal...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1997